Based on 2022 Singapore HDB resale price (real-life) data sets, your team is supposed to construct a multiple regression model (for one particular district) to explain the HDB resale price (ResalePrice) in dollars with the given independent variables.
Documentation and Presentation: 10 marks
Methodology: 10 marks
R-codes, computer outputs interpretation and graphical explanations: 15 marks
Recommendations and conclusions: 15 marks
Written Report: PDF Format: Within 10pages excluding the cover page and Appendix
Appendix: codes with computer outputs
You are required to provide the detailed documentation of how you search your recommended model for inference purpose and justify each step in your data analysis. You are also expected to provide model assumption justification and hypothesis testing evidences (R-codes and computer outputs) with clear explanations that your recommended model is the best model among all the models considered according to BIC criterion. Based on your final recommended model, state clearly your recommendations and conclusions.
Load Data In
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Sengkang <- read.csv("data/Sengkang2023P.csv", stringsAsFactors = TRUE)
head(Sengkang, 5)
## Date Type Block Street Story Area Model LeaseBegin LeaseRemain
## 1 2022-01 3 ROOM 331C ANCHORVALE ST 10 TO 12 67 Model A 2015 92 years 09 months
## 2 2022-01 3 ROOM 209C COMPASSVALE LANE 16 TO 18 67 Model A 2011 88 years 11 months
## 3 2022-01 3 ROOM 211C COMPASSVALE LANE 10 TO 12 68 Model A 2013 90 years 02 months
## 4 2022-01 3 ROOM 467B FERNVALE LINK 04 TO 06 68 Model A 2016 93 years 08 months
## 5 2022-01 3 ROOM 414B FERNVALE LINK 07 TO 09 68 Model A 2016 93 years 01 month
## ResalePrice
## 1 420000
## 2 418000
## 3 410000
## 4 382000
## 5 410000
From the regression above, we know that there are too many categorical variables to consider. We have to condense them accordingly.
length(unique(Sengkang$Block))
## [1] 521
length(unique(Sengkang$Street))
## [1] 29
length(unique(Sengkang$LeaseRemain))
## [1] 226
We can see from the code chunk above that block will have 521, street has 29 and LeaseRemain has 226 categorical variables
We will use the lubridate package to adjust the year and calculate lease years used as a numeric rather than a categorical variable. Since years_used and LeaseRemain are perfected correlated, we will drop LeaseRemaind from the dataframe.
site, the HDB blocks are numbered by 100+, 200+, 300+ and 400+ in Rivervale, Compassvale, Anchorvale and Fernvale respectively.
df <- Sengkang %>%
mutate(Date = lubridate::ym(Date),
LeaseBegin = lubridate::ym( paste0(LeaseBegin,"-01")),
years_used = as.numeric((Date - LeaseBegin)/365),
subzone = ifelse(grepl("^1", Block), "Rivervale",
ifelse(grepl("^2", Block), "Compassvale",
ifelse(grepl("^3", Block), "Anchorvale",
ifelse(grepl("4", Block), "Fernvale", "others")))),
.before = Street) %>%
mutate(subzone= as.factor(subzone)) %>%
select(-LeaseRemain, -LeaseBegin)
Using leaseremain as a factor will generate too many binary variables. Convert them into years_used would be easier. Date and LeaseBegin variables must be in date type before substracting between the two. The output would be in (drtn) days and thus we have to set it to numeric set to years.
We will first calculate the BIC of the regression of all variables.
reg_all <- lm(ResalePrice ~ ., data = df)
BIC(reg_all)
## [1] 52020.52
We have three types of location columns now, Block, Street and subzone. There should be high correlation between the X variables for these 3 variables, we will test the BIC number by dropping each variable out and picking the model with the lowest BIC
L1 <- lm(ResalePrice ~ .-Block, data = df)
L2 <- lm(ResalePrice ~ .-Street,
data = df)
L3 <- lm(ResalePrice ~ .-subzone,
data = df)
It would seem removing Block would be the best. Now we will compare between street and subzone ## Removing Block and Street
L4 <- lm(ResalePrice ~ .-Block-Street,
data = df)
L5 <- lm(ResalePrice ~ .-Block-subzone, data = df)
BIC_location <- data.frame(lm = c(".-Block",".-Street",".-subzone",".-Block-Street",".-Block-subzone"),
BIC = c(BIC(L1),BIC(L2),BIC(L3),BIC(L4),BIC(L5)))
BIC_location
## lm BIC
## 1 .-Block 49828.63
## 2 .-Street 52020.52
## 3 .-subzone 52020.52
## 4 .-Block-Street 50903.32
## 5 .-Block-subzone 49875.08
plot(L1)
plot(df$Date, df$ResalePrice)
plot(df$Type, df$ResalePrice)
plot(df$Block, df$ResalePrice)
plot(df$years_used, df$ResalePrice)
plot(df$subzone, df$ResalePrice)
plot(df$Street, df$ResalePrice)
plot(df$Story, df$ResalePrice)
plot(df$Area, log(df$ResalePrice))
plot(df$Model, df$ResalePrice)
residual_plot <- function(x,regression){
n <- names(x)
r <- residuals(regression)
for (i in 1:length(x)){
plot(x$n[i], r)
}
}
residual_plot(df, L1)
r <- residuals(L1)
plot(df$Date, r,
xlab = "Date", ylab = "Residuals")
plot(df$Type, r,
xlab = "Type", ylab = "Residuals")
plot(df$Block, r,
xlab = "Block", ylab = "Residuals")
plot(df$years_used, r,
xlab = "years_used", ylab = "Residuals")
plot(df$subzone, r,
xlab = "subzone", ylab = "Residuals")
plot(df$Street, r,
xlab = "Street", ylab = "Residuals")
plot(df$Story, r,
xlab = "Story", ylab = "Residuals")
plot(df$Area, r,
xlab = "Area", ylab = "Residuals")
plot(df$Model, r,
xlab = "Model", ylab = "Residuals")